Skip to content

Add 5xx retry policy for transient server errors#26821

Merged
simorenoh merged 7 commits into
Azure:mainfrom
andrewmathew1:andrewmathew1/fix-5xx-retry-behavior
Jun 1, 2026
Merged

Add 5xx retry policy for transient server errors#26821
simorenoh merged 7 commits into
Azure:mainfrom
andrewmathew1:andrewmathew1/fix-5xx-retry-behavior

Conversation

@andrewmathew1
Copy link
Copy Markdown
Contributor

Description

Adds a retry policy in the clientRetryPolicy for 500, 502, and 504 responses. Only read operations are retried; writes are surfaced to the caller immediately.

The retry budget is one in-region retry followed by one cross-region retry. The cross-region retry only fires when cross-region retries are enabled and a preferred location is available to fail over to. After both attempts are exhausted (or skipped), the original 5xx response is returned wrapped as a non-retriable error, matching the pattern already used for 503 / 404 / 403 in this file.

Fixes #25639.

Adds a retry policy for 500, 502, and 504 responses in the azcosmos
client retry pipeline. Only read operations are retried, which is
consistent with the .NET, Java, and Python Cosmos SDKs.

The retry budget is one in-region retry followed by one cross-region
retry. The cross-region retry only fires when cross-region retries are
enabled and a preferred location is available to fail over to.

Fixes Azure#25639.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 03:11
@andrewmathew1 andrewmathew1 requested a review from a team as a code owner May 20, 2026 03:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds transient 5xx retry handling to the Cosmos client retry policy so read requests can recover from certain server-side failures while ensuring write requests are not retried.

Changes:

  • Added retry handling for HTTP 500, 502, and 504 in clientRetryPolicy, with an in-region retry followed by an optional cross-region retry.
  • Introduced serverErrorRetryCount to track the 5xx retry budget independently of other retry counters.
  • Added unit tests for read/write behavior on these 5xx responses and documented the change in the package changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
sdk/data/azcosmos/cosmos_client_retry_policy.go Implements 5xx retry logic and tracking in the client retry policy.
sdk/data/azcosmos/cosmos_client_retry_policy_test.go Adds tests validating read retries and ensuring writes are not retried for 5xx.
sdk/data/azcosmos/CHANGELOG.md Notes the new 5xx retry behavior in the unreleased changelog.

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
andrewmathew1 and others added 2 commits May 19, 2026 23:26
Adds tests covering scenarios the original PR did not exercise:

* In-region retry succeeds without a cross-region attempt.
* Cross-region retries disabled - only the in-region retry runs.
* Exhausted retries surface `*azcore.ResponseError` with the
  original status code.
* 501 (Not Implemented) is not retried.
* 500 interleaved with 503 retries correctly across regions.

Together with the existing tests this brings
`attemptRetryOnServerError` and `shouldRetryStatus` to 100%
statement coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses review feedback on the 5xx retry policy. Previously the
cross-region retry was gated only on `enableCrossRegionRetries` and
the length of `preferredLocations`, so single-region accounts (or
configurations where the only resolved read endpoint matches the
current endpoint) would burn the cross-region retry on the same URL
without actually failing over.

The cross-region retry now also requires more than one resolved read
endpoint in the location cache. The mock location-cache helper used in
tests is updated to populate `readEndpoints` / `writeEndpoints` per
available location so existing cross-region retry tests continue to
exercise the failover path, and a new test pins the single-endpoint
behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icts

Re-applied the 5xx (500/502/504) read-retry additions on top of upstream's
refactored clientRetryPolicy (cross-region connection-error failover + 408
handling): serverErrorRetryCount field, maxServerErrorRetryCount, the 5xx
switch case with advanceLocation gating, shouldRetryStatus entries, and
attemptRetryOnServerError. Added locationCache.readEndpointCount() so the
cross-region 5xx guard reads readEndpoints under RLock. Re-added the
server-error tests, injecting multiple read endpoints inline (via
multiReadEndpointLC) instead of mutating the shared CreateMockLC.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andrewmathew1 andrewmathew1 requested a review from a team as a code owner June 1, 2026 15:12
Comment thread .github/workflows/event-processor.yml Outdated
Comment thread .github/workflows/event-processor.yml Outdated
Comment thread sdk/data/azcosmos/CHANGELOG.md Outdated
Comment thread sdk/data/azcosmos/cosmos_client_retry_policy_test.go
andrewmathew1 and others added 3 commits June 1, 2026 15:27
Co-authored-by: Simon Moreno <30335873+simorenoh@users.noreply.github.com>
Address PR review feedback:
- Add TestReadServerError_CrossRegionRoutesToDifferentEndpoint, which backs the
  in-region and next-preferred read endpoints with two distinct mock servers and
  a host-routing transport. It asserts the first two attempts hit the in-region
  endpoint and the cross-region retry hits the second, distinct endpoint --
  proving failover changes the request target, not just retry counters.
- Update the 5xx CHANGELOG entry to link the PR instead of the issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conflict-resolution merge had reverted this workflow file to the PR
branch's older unquoted GITHUB_ENV form to avoid a workflow-scope push
restriction, which left the PR showing an unintended diff. Restore it to
upstream/main verbatim so the PR no longer modifies the workflow file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andrewmathew1 andrewmathew1 force-pushed the andrewmathew1/fix-5xx-retry-behavior branch from 10c56ee to e20bea0 Compare June 1, 2026 19:57
Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR: #26820 we changed the status codes list passed in to exclude 429s so that we would own all the logic on our end only and exclude Core's logic. Do we need to do the same for the 500/502/504 in this one? - EDIT: Nevermind, just realized we wrap in non-retryable so should be fine.

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@simorenoh simorenoh merged commit bea393e into Azure:main Jun 1, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 5xx retry policy

4 participants